linear contextual bandit problem
A Smoothed Analysis of the Greedy Algorithm for the Linear Contextual Bandit Problem
Bandit learning is characterized by the tension between long-term exploration and short-term exploitation. However, as has recently been noted, in settings in which the choices of the learning algorithm correspond to important decisions about individual people (such as criminal recidivism prediction, lending, and sequential drug trials), exploration corresponds to explicitly sacrificing the well-being of one individual for the potential future benefit of others. In such settings, one might like to run a ``greedy'' algorithm, which always makes the optimal decision for the individuals at hand --- but doing this can result in a catastrophic failure to learn. In this paper, we consider the linear contextual bandit problem and revisit the performance of the greedy algorithm. We give a smoothed analysis, showing that even when contexts may be chosen by an adversary, small perturbations of the adversary's choices suffice for the algorithm to achieve ``no regret'', perhaps (depending on the specifics of the setting) with a constant amount of initial training data. This suggests that in slightly perturbed environments, exploration and exploitation need not be in conflict in the linear setting.
A Smoothed Analysis of the Greedy Algorithm for the Linear Contextual Bandit Problem
Bandit learning is characterized by the tension between long-term exploration and short-term exploitation. However, as has recently been noted, in settings in which the choices of the learning algorithm correspond to important decisions about individual people (such as criminal recidivism prediction, lending, and sequential drug trials), exploration corresponds to explicitly sacrificing the well-being of one individual for the potential future benefit of others. In such settings, one might like to run a greedy'' algorithm, which always makes the optimal decision for the individuals at hand --- but doing this can result in a catastrophic failure to learn. In this paper, we consider the linear contextual bandit problem and revisit the performance of the greedy algorithm. We give a smoothed analysis, showing that even when contexts may be chosen by an adversary, small perturbations of the adversary's choices suffice for the algorithm to achieve no regret'', perhaps (depending on the specifics of the setting) with a constant amount of initial training data. This suggests that in slightly perturbed environments, exploration and exploitation need not be in conflict in the linear setting.
Reviews: A Smoothed Analysis of the Greedy Algorithm for the Linear Contextual Bandit Problem
This paper is about contextual linear bandits. The authors are trying to understand the phenomenon observed in practice that exploration seems much less important than the theory usually suggests. To make a start understanding this they analyze the greedy policy that chooses the arm for which the inner product with the least squares estimator and the context is largest. With no further assumptions this leads to linear regret. The authors make a'perturbation' assumption where the contexts are first chosen by an adversary and then perturbed using zero-mean noise. Under carefully chosen assumptions on the noise they show that the greedy algorithm now enjoys O(sqrt(T)) regret. The two standard settings are studied.
Almost Optimal Batch-Regret Tradeoff for Batch Linear Contextual Bandits
Zhang, Zihan, Ji, Xiangyang, Zhou, Yuan
We study the optimal batch-regret tradeoff for batch linear contextual bandits. For any batch number $M$, number of actions $K$, time horizon $T$, and dimension $d$, we provide an algorithm and prove its regret guarantee, which, due to technical reasons, features a two-phase expression as the time horizon $T$ grows. We also prove a lower bound theorem that surprisingly shows the optimality of our two-phase regret upper bound (up to logarithmic factors) in the \emph{full range} of the problem parameters, therefore establishing the exact batch-regret tradeoff. Compared to the recent work \citep{ruan2020linear} which showed that $M = O(\log \log T)$ batches suffice to achieve the asymptotically minimax-optimal regret without the batch constraints, our algorithm is simpler and easier for practical implementation. Furthermore, our algorithm achieves the optimal regret for all $T \geq d$, while \citep{ruan2020linear} requires that $T$ greater than an unrealistically large polynomial of $d$. Along our analysis, we also prove a new matrix concentration inequality with dependence on their dynamic upper bounds, which, to the best of our knowledge, is the first of its kind in literature and maybe of independent interest.
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
- Europe > France > Grand Est > Bas-Rhin > Strasbourg (0.04)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Data Science > Data Mining > Big Data (0.47)
A Smoothed Analysis of the Greedy Algorithm for the Linear Contextual Bandit Problem
Kannan, Sampath, Morgenstern, Jamie H., Roth, Aaron, Waggoner, Bo, Wu, Zhiwei Steven
Bandit learning is characterized by the tension between long-term exploration and short-term exploitation. However, as has recently been noted, in settings in which the choices of the learning algorithm correspond to important decisions about individual people (such as criminal recidivism prediction, lending, and sequential drug trials), exploration corresponds to explicitly sacrificing the well-being of one individual for the potential future benefit of others. In such settings, one might like to run a greedy'' algorithm, which always makes the optimal decision for the individuals at hand --- but doing this can result in a catastrophic failure to learn. In this paper, we consider the linear contextual bandit problem and revisit the performance of the greedy algorithm. We give a smoothed analysis, showing that even when contexts may be chosen by an adversary, small perturbations of the adversary's choices suffice for the algorithm to achieve no regret'', perhaps (depending on the specifics of the setting) with a constant amount of initial training data.